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ABSTRACT 


Introduction: At this junction, machine learning demand is enhancing in almost every critical area to catch interesting and 
decision-making patterns. This inductive research objective is investigating sophisticated different techniques of machine learn- 
ing to effectively analyze health data. Naturally, the present health-related dataset is most sensitive, crucial, and needs accurate 
analysis, hence result generated by different learning algorithms have paramount importance. This sensitivity enhances and 
promoted data analytics, interest, and role through machine learning in the health sector. 





Objectives: This research aims to analyze and predict diabetes by applying elegant learning algorithms on the diabetes dataset. 
The article also shows a comparative study analysis of algorithms. 


Methods: This research uses the median method to preprocess the dataset. After preprocessing, ten different machine learning 
algorithms are applied to the diabetes dataset in this paper. 


Results: This document uses a diabetes dataset that has eight different symptoms or features to predict disease. To get a better 
classification technique, various ML mechanisms results are compared and analyzed. This study outcome can be further utilized 
in incoming research based on diabetic health problems. 


Conclusion: A linear support vector machine shows better detection results compared to others. 


Key Words: Machine Learning, Predictive Analysis, Gaussian Process, Diabetes Prediction, SVM, Decision Tree, Nearest 
Neighbor 


After preprocessing, eight different machine learning algo- 
rithms are applied to the diabetes dataset in this paper. This 
document also shows comparative study and analysis work 
performed on the dataset using 10 learning classifiers. Near- 


INTRODUCTION 


In the health sector, diabetes incidences are increasing glob- 
ally at a fast pace and become a supreme concern. To detect 


or predict diabetes is a paramount health-related concern. 
Currently, to discover the required pattern ubiquitously Ma- 
chine Learning (ML) is widely used. Health sector sensitivity 
accelerates researchers to work in this sector using machine 
learning. To work with data-driven problems machine learn- 
ing 1s adopted in many areas. However, machine learning 1m- 
plementation for problems essentially needs data knowledge. 
There are several algorithms present in machine learning. 
These techniques show the different result with a different 
dataset. A dataset may have null values, incorrect informa- 
tion, and incomplete data. To get good result preprocess- 
ing of the dataset is required. Preprocessing improves per- 
formance and hence prediction can be done smoothly. This 
research uses the median method to preprocess the dataset. 
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est Neighbor’, Linear SVM’, RBF SVM’, Gaussian Process, 
Decision Tree, Random Forest*, MLP Classifier, Adaboost, 
Naive Bayes, and QDA machine learning classifiers are used 
in this research. These classifiers are implemented on the 
diabetes dataset and then their results are compared and ana- 
lyzed. In this letter, these classifiers predict diabetes based 
on eight different features. Features included in this study are 
Pregnancies, Glucose, Blood Pressure, Skin Thickness, Insu- 
lin, BMI, Diabetes Pedigree Function, and Age. In medical 
science, diabetes is one of the chronic and uprising diseases 
that need special attention. Keeping the necessity of detect- 
ing diabetes problems in mind, researchers are giving their 
efforts towards this field.” Seriousness towards the health 
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sector increases concern to predict or detect diabetes based 
on some features or symptoms. With the change of society 
in a way of living, eating, and physical work, the possibility 
of diabetes drastically rises. The Incurability of this disease 
makes its prediction more essential. However, proper diet, 
exercise, and precautions are needed to reduce and control 
complications and ill effects of the disease. The negative im- 
pact of this disease on the patient can lead to even death. If 
proper care and precautions are not taken on time then it can 
affect health badly. This document works for a diabetes pre- 
diction based on eight different symptoms or features. This 
research predicts that either patient 1s diabetic or not based 
on the entire mentioned features. For prediction, this paper 
uses different machine learning algorithm and compare the 
potential of classifiers based on their accuracy. This arti- 
cle also elaborates and analyses comprehensively machine 
learning classification algorithms. 





In this direction, work is going on continually. Daniel et al’ 
computation techniques of parallel as well distributed sys- 
tem, also use techniques of deep learning for efficient and 
proper analysis of health care data. The paper focuses on the 
establishment of a relationship between variables of medi- 
cal and laboratory assessment along with adverse event’s oc- 
currence. In health sector employing using deep learning.’ 
Research provides relative merit analysis, technique pitfalls, 
along with future outlook. The main emphasis of this re- 
search is on key deep learning applications such as public 
health, medical imaging, etc. 





Tian et al? analyzes data of trunk sway as well as techniques 
of machine learning to get automatically balance evaluation 
and provides assessment accurately outside the clinic. For 
mounting poisoning attacks a systematic approach as well as 
algorithm-independent across algorithms of machine learning 
and datasets in the health sector.!° An approach for improving 
postprandial glucose regulation by using the ML-based KNN 
method is proposed.'! The ML approach is useful in many ap- 
plications. Authors Ref.” aim is investigating the purpose of 
sophisticated techniques of machine learning for personalized 
models’ development that targets detecting risk in T2DM pa- 
tients for non-fatal as well as fatal CVD incidence. 


The aim of the authors Ref.’ is assessing the association of 
type 2 diabetes, as well as HW phenotype and also predic- 
tive powers, are evaluated for combined Korean adult’s TG 
levels with anthropometric measurements. PhysOnline is 
presented by researchers in paper.'* PhysOnline is a pipeline 
that is built for Apache Spark that is an open-source platform 
to work for physiological data streaming to extract features 
online as well as through machine learning. 


Wen et al'> analyzed the detectability of Microaneurysms in 
this article with the use of pixel patches of size 25 by 25 
extracted from the images of finds present in the database of 
DIAbeticRETinopathy1.e DIARETDB1. 


MATERIALS AND METHODS 


In the biomedical field, the most serious and critical disease 
even for human life is Diabetes!®. Because health data is di- 
rectly related to public life becomes critical and that’s why 
the health industry needs additional and special attention. 
Health-related data are available vastly and it is growing 
continually and becomes much more complex. This enor- 
mous data open an opportunity for researchers to analyze it 
and based on that analyses try to predict or detect disease. 
This article uses the diabetes dataset and based on features 
available in this dataset predicts diabetes. Diabetes is one 
of the uprisings and most people affecting disease. It drasti- 
cally affects life and plays a negative role in health or even 
can reduce an affected person’s life. The incurable nature of 
this disease makes its prediction more vital. This disease can 
be handled by taking appropriate precautions and exercise 
timely and accurately. This is another important reason that 
also shows the necessity of this disease prediction.'*'* This 
research applies prediction algorithms on the diabetes data- 
set and shows the result. This dataset is taken from Kaggle. 
This is a good dataset consist of eight different symptoms 
and 768 records. 768 different measurements are taken re- 
lated to 8 different features, named pregnancies, glucose, 
blood pressure, skin thickness, insulin, BMI, diabetes pedi- 
gree function, and age present in the dataset. Based on the 
mentioned symptoms disease is predicted. The main concern 
with any dataset is the presence of null values, non-relevant 
data, noisy data, etc. All these mentioned problems can lead 
to the problem of incorrect prediction which further leads 
the way towards the wrong judgment. And finally, the wrong 
decision goes to badly affect the complete system. To over- 
come this problem, preprocessing on a dataset is done. 








Preprocessing 1s a way of getting the dataset with correct val- 
ues that helps further in achieving better result compared to 
unprocessed data. Table. 1 briefly demonstrates pre-process- 
ing steps that can be used with any dataset and makes data 
efficient for further classification or prediction tasks. The 
dataset used in this research is the diabetes dataset. To pre- 
dict diabetes leaning models trained by this dataset. Before 
applying learning algorithms, it is necessary to pre-process 
data.!°?°?! After studying the dataset, it is observed that many 
cells have a value of zero. These zeros are nothing but null 
or missing values. This research replaces zero, error, and null 
values with median measurement. Table 2 and Figure | show 
all eight features of the dataset before pre-processing with 
several null values. After the implementation of the median, 
a dataset with null values is shown in Table 3. 





The research uses a diabetes dataset of eight features and 
applies different machine learning models for analysis.!”'® 
These models are Nearest Neighbor, Linear SVM, RBF 
SVM, Gaussian process, Decision tree, Random forest, MLP 
classifier, Adaboost, Naive Bayes, and QDA. Table. 4 briefly 
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explains the learning models used in this research.*?* 


RESULTS 


This study is developed to detect diabetes early in a patient 
on the strength of eight different symptoms. So that required 
precautions at the right time can be taken. Diabetes is a dis- 
ease which needs special attention because of the incurable 
nature it has. This crucial nature gives diabetes prediction 
paramount importance. 





During the experiment, the result achieved to detect diabetes 
is shown in Table 5 and graphically analyze in the Figure 2. 
For the experiment, various learning algorithms are applied 
to the diabetes dataset. First of all, the diabetes dataset is pre- 
processed and then apply 10 different learning paradigms. 
The potential of these models is analyzed based on accuracy. 





On account of the accuracy potential of learning, models are 
shown numerically in the Table 5 and using graphs in Figure 
2. Using Table 5 and Figure 2 it is easy to understand and 
compare the results of each classifier. A confusion matrix of 
results is shown in Table 6. 


DISCUSSION 


The diabetes detection result is shown in the previous sec- 
tion. After a thorough analysis of Table. 4 and Figure. 2, 
it can be concluded that the linear support vector machine 
shows a better detection result compared to others. Because 
the accuracy of linear SVM” is higher than others proves 
that the potential of this classifier on the diabetes dataset is 
high”. Result concludes the following: 


1. The linear support vector machine shows a good result 
for the small-size dataset and infeasible with a dataset 
having a large number of records’. This research data- 
set is not too large. 

2. This classifier is good for a non-sparse dataset for bi- 
nary classification”. The outcome label of the dataset 
used in this research is either diabetic or non-diabetic 
that is binary. 

3. The classification output for this model is better for 
linear and this paper used a linear dataset. 

4. This model can work well by identifying a small num- 
ber of parameters”. This research dataset has eight 
features. 

5. This technique 1s not working well in case of high 
computation” and the dataset of this research needs 
less computation. 


CONCLUSION 


A health-related issue like diabetes 1s one of the important 
diseases which needs paramount attention. The sensitivity of 


this disease enhances and promoted the interest in analyzing 
it through machine learning. This study analyses diabetes by 
using ML algorithms and found that the linear support vector 
machine shows better detection results compared to others. 
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Step 
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Second Step Data Integration 


Pre-Processing 
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Third Step Data Transformation [Itis a way to transform- 
ing or changing data 
into required format. 

Forth Step Data Reduction In order to become cost 


efficient, reduce the 
data in small size. 
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Table 3: Dataset Features and Number of Null Values 
(After pre-processing) 


Column/Feature Name Number of Null Value 


Pregnancies 

Glucose 

BloodPressure 
SkinThickness 

Insulin 

BMI 

Diabetes Pedigree Function 


Age 


O O O O O O O © 


Table 4: Machine Learning Models Brief Description 


Machine Learning Models Explanation 


Nearest Neighbor 


Linear Support Vector Ma- 
chine 


RBF (Radial basis function) 
Support Vector Machine 


Gaussian Process 


Decision Tree 


Supervised learning method 
used to recognize patterns 
and works as per distance 
function 


Data points are classified by 
finding the hyperplane in 
space 


RBF is a Kernel function used 
for classification with SVM 


Used for regression along 
with classification. 


Performs classification and 
represents in tree form. 
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Table 4: (Continued) 





Machine Learning Models Explanation 


Random Forest 


(Multi Layer) MLP classifier 
Adaboost 
Naive Bayes 


(Quadratic Discriminant 
Analysis) QDA classifier 


Table 5: Machine Learning 
Accuracy 





Machine Learning Models Accuracy Gained 


Nearest Neighbor 
Linear Support Vector Machine 


RBF (Radial basis function) Sup 
port Vector Machine 


Gaussian Process 

Decision Tree 

Random Forest 

(Multi Layer) MLP classifier 
Adaboost 

Naive Bayes 


(Quadratic Discriminant Analy- 
sis) QDA classifier 
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Combination of multiple 
decision tree. 


Artificial neural network su- 
pervised learning classifier. 


Meta algorithm for classifica- 
tion. 


Probabilistic bayes based 
classifier. 


Technique of Machine learn- 
ing for classification. 


Models with their Gained 


71.15 
77-93 


- 65.31 


67.07 
70.03 
74-9 
67.6 
733 
73-79 


73:95 


Features v/s Number of Null values 
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Figure 1: Dataset features v/s Number of Null Values (Before 


Pre-Processing). 


Learning Modals v/s Accuracy 
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Figure 2: Machine Learning Models v/s Accuracy. 
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